46 research outputs found

    Simple open stance classification for rumour analysis

    Get PDF
    Stance classification determines the attitude, or stance, in a (typically short) text. The task has powerful applications, such as the detection of fake news or the automatic extraction of attitudes toward entities or events in the media. This paper describes a surprisingly simple and efficient classification approach to open stance classification in Twitter, for rumour and veracity classification. The approach profits from a novel set of automatically identifiable problem-specific features, which significantly boost classifier accuracy and achieve above state-of-the-art results on recent benchmark datasets. This calls into question the value of using complex sophisticated models for stance classification without first doing informed feature extraction

    Helping crisis responders find the informative needle in the tweet haystack

    Get PDF
    Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this problem. Messages are filtered for informativeness based on a definition of the concept drawn from prior research and crisis response experts. Informative messages are tagged for actionable data -- for example, people in need, threats to rescue efforts, changes in environment, and so on. In all, eight categories of actionability are identified. The two components -- informativeness and actionability classification -- are packaged together as an openly-available tool called Emina (Emergent Informativeness and Actionability)

    Efficient named entity annotation through pre-empting

    Get PDF
    Linguistic annotation is time-consuming and expensive. One common annotation task is to mark entities - such as names of people, places and organisations - in text. In a document, many segments of text often contain no entities at all. We show that these segments are worth skipping, and demonstrate a technique for reducing the amount of entity-less text examined by annotators, which we call "preempting". This technique is evaluated in a crowdsourcing scenario, where it provides downstream performance improvements for the same size corpus

    A HMM POS Tagger for Micro-blogging Type Texts

    Get PDF
    The high volume of communication via micro-blogging type messages has created an increased demand for text processing tools customised the unstructured text genre. The available text processing tools developed on structured texts has been shown to deteriorate significantly when used on unstructured, micro-blogging type texts. In this paper, we present the results of testing a HMM based POS (Part-Of-Speech) tagging model customized for unstructured texts. We also evaluated the tagger against published CRF based state-of-the-art POS tagging models customized for Tweet messages using three publicly available Tweet corpora. Finally, we did cross-validation tests with both the taggers by training them on one Tweet corpus and testing them on another one

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy

    Get PDF
    Crowdsourcing is an increasingly popular, collaborative approach for acquiring annotated corpora. Despite this, reuse of corpus conversion tools and user interfaces between projects is still problematic, since these are not generally made available. This demonstration will introduce the new, open-source GATE Crowdsourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units and back, as well as automatically generating reusable crowdsourcing interfaces for NLP classification and selection tasks. The entire workflow will be demonstrated on: annotating named entities; disambiguating words and named entities with respect to DBpedia URIs; annotation of opinion holders and targets; and sentiment

    Novel psychoactive substances: An investigation of temporal trends in social media and electronic health records

    Get PDF
    Background: Public health monitoring is commonly undertaken in social media but has never been combined with data analysis from electronic health records. This study aimed to investigate the relationship between the emergence of novel psychoactive substances (NPS) in social media and their appearance in a large mental health database. Methods: Insufficient numbers of mentions of other NPS in case records meant that the study focused on mephedrone. Data were extracted on the number of mephedrone (i) references in the clinical record at the South London and Maudsley NHS Trust, London, UK, (ii) mentions in Twitter, (iii) related searches in Google and (iv) visits in Wikipedia. The characteristics of current mephedrone users in the clinical record were also established. Results: Increased activity related to mephedrone searches in Google and visits in Wikipedia preceded a peak in mephedrone-related references in the clinical record followed by a spike in the other 3 data sources in early 2010, when mephedrone was assigned a ‘class B’ status. Features of current mephedrone users widely matched those from community studies. Conclusions: Combined analysis of information from social media and data from mental health records may assist public health and clinical surveillance for certain substance-related events of interest. There exists potential for early warning systems for health-care practitioners

    UFPRSheffield: Contrasting Rule-based and Support Vector Machine Approaches to Time Expression Identification in Clinical TempEval

    Get PDF
    We present two approaches to time expression identification, as entered in to SemEval2015 Task 6, Clinical TempEval. The first is a comprehensive rule-based approach that favoured recall, and which achieved the best recall for time expression identification in Clinical TempEval. The second is an SVM-based system built using readily available components, which was able to achieve a competitive F1 in a short development time. We discuss how the two approaches perform relative to each other, and how characteristics of the corpus affect the suitability of different approaches and their outcomes

    Analysis of Temporal Expressions Annotated in Clinical Notes

    Get PDF
    Annotating the semantics of time in language is important. THYME is a recent temporal annotation standard for clinical texts. This paper examines temporal expressions in the first major corpus released under this standard. It investigates where the standard has proven difficult to apply, and gives a series of recommendations regarding temporal annotation in this important domain
    corecore